-
Notifications
You must be signed in to change notification settings - Fork 470
Add DeepSeek V3.2 support #3270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new Sparse Flash Attention (SFA) mechanism, primarily for DeepSeek V2 models on Ascend hardware. The changes are extensive, adding new attention backends, modifying model definitions, and updating distributed communication logic to support the new KV cache structure for SFA. While the implementation adds significant new functionality, I have identified a critical issue in multi-stream metadata handling that could lead to incorrect attention calculations. Additionally, there are several instances of hardcoded configuration values that make the implementation brittle and prone to errors with different model variants. I've also flagged a few TODO
comments in critical code paths that suggest the logic may be incomplete or unverified. Addressing these points is crucial for the stability and correctness of the new SFA feature.
return model_input_split_v1_mla_attn( | ||
ms_split_config=ms_split_config, | ||
attn_metadata=self, | ||
_metadata_cls=AscendMLAMetadata, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The split_metadata_for_multistream
method is incorrectly using AscendMLAMetadata
as the _metadata_cls
when splitting metadata. Since this method is part of AscendSFAMetadata
, it should be using its own class, AscendSFAMetadata
, to ensure the correct metadata structure is propagated in multi-stream scenarios. Using the wrong metadata class can lead to incorrect attention calculations or runtime errors.
return model_input_split_v1_mla_attn( | |
ms_split_config=ms_split_config, | |
attn_metadata=self, | |
_metadata_cls=AscendMLAMetadata, | |
) | |
return model_input_split_v1_mla_attn( | |
ms_split_config=ms_split_config, | |
attn_metadata=self, | |
_metadata_cls=AscendSFAMetadata, | |
) |
prefill_q_pe = torch_npu.npu_interleave_rope( | ||
prefill_q_pe, cos_q, sin_q) # BNSD | ||
prefill_q_pe = prefill_q_pe.squeeze(2) #BSH | ||
# q[..., self.qk_nope_head_dim:] = prefill_q_pe # TODO:???? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line contains a TODO:????
comment, which indicates uncertainty about the implementation. This part of the code seems related to applying rotary position embeddings. Leaving such a TODO suggests the logic might be incomplete or incorrect, which could lead to erroneous attention results. This should be investigated and resolved before merging.
def mla_epilog(self, | ||
attn_output: torch.Tensor = None, | ||
absorb: bool = False): | ||
# TODO: need to check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A TODO: need to check
comment is present in the mla_epilog
function. This indicates that the logic in this function may not have been fully verified. Unchecked code, especially in critical paths like the attention epilogue, poses a risk and should be reviewed and confirmed to be correct before merging.
# TODO(zzzzwwjj): wait transformers add these params | ||
self.n_heads: int = 64 # 64 | ||
self.head_dim: int = 128 # 128 | ||
self.index_topk: int = 2048 # 2048 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameters n_heads
, head_dim
, and index_topk
for the indexer are hardcoded. This makes the implementation brittle and may cause silent errors or incorrect behavior if used with model variants that have different configurations. These values should be derived from the model config
object rather than being hardcoded.
def __init__(self, | ||
config, | ||
dim: int = 7168, | ||
n_heads: int = 64, | ||
head_dim: int = 128, | ||
index_topk: int = 2048, | ||
q_lora_rank: int = 1536, | ||
rope_head_dim: int = 64, | ||
quant_config: Optional[QuantizationConfig] = None, | ||
prefix: Optional[str] = ""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Indexer
class __init__
method uses hardcoded default values for several dimensional parameters like dim
, n_heads
, head_dim
, etc. This is risky as it might not match the actual model configuration passed during instantiation, leading to silent errors or incorrect behavior. These parameters should be passed explicitly from the model's configuration during initialization, and the defaults should be used cautiously.
sfa_bytes = 128 * self.block_size * get_dtype_size( | ||
self.dtype) if self.use_sfa else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value 128
is hardcoded in the sfa_bytes
calculation. This value appears to correspond to the indexer's head dimension. Using magic numbers reduces code maintainability and can lead to bugs if the indexer's configuration changes. This value should be retrieved from a configuration source or defined as a named constant to improve clarity and robustness.
self.dim: int = config.hidden_size # 7168 | ||
# TODO(zzzzwwjj): wait transformers add these params | ||
self.n_heads: int = 64 # 64 | ||
self.head_dim: int = 128 # 128 | ||
self.index_topk: int = 2048 # 2048 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameters dim
, n_heads
, head_dim
, and index_topk
for the indexer are hardcoded. This makes the implementation brittle and may cause issues if used with model variants that have different configurations. These values should be derived from the model config
object instead of being hardcoded to ensure the implementation is robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new Sparse Flash Attention (SFA) backend for Ascend NPUs, specifically for the DeepSeek-V2 model. The changes include the core SFA implementation, metadata builders, torchair integration, and patches to vLLM to select and configure this new backend. While the implementation is extensive, I've found several critical issues related to incorrect path selection between prefill and decode operations in mixed batches, which could lead to incorrect attention results. There are also some maintainability concerns with hardcoded values. Addressing these issues is crucial for the correctness and robustness of the new backend.
def apply_attention_fusion(self, query_states, key_states, topk_indices, | ||
attn_metadata: M): | ||
# repeat k/v heads if n_kv_heads < n_heads | ||
q_nope, q_pe = query_states | ||
k_nope, k_rope = key_states | ||
|
||
if attn_metadata.prefill is not None: | ||
|
||
prefill_metadata = attn_metadata.prefill | ||
|
||
slc_fa_fusion = torch.ops.custom.npu_sparse_flash_attention( | ||
query=q_nope, | ||
key=k_nope, | ||
value=k_nope, | ||
sparse_indices=topk_indices, | ||
scale_value=self.scale, | ||
sparse_block_size=1, | ||
block_table=prefill_metadata.block_table, | ||
actual_seq_lengths_query=prefill_metadata.query_lens, | ||
actual_seq_lengths_kv=prefill_metadata.seq_lens, | ||
query_rope=q_pe, | ||
key_rope=k_rope, | ||
layout_query="TND", | ||
layout_kv="PA_BSND", | ||
sparse_mode=3, | ||
) | ||
|
||
elif attn_metadata.decode is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic to distinguish between prefill and decode paths is incorrect. In a mixed batch with both prefill and decode requests, attn_metadata.prefill
will be non-None. This will cause the decode_attn_output
computation to incorrectly enter the prefill branch, leading to wrong attention results. The elif
for the decode path will never be reached in a mixed batch.
To fix this, you should add a boolean flag, e.g., is_prefill
, to the apply_attention_fusion
function signature. Then, use this flag to select the correct code path.
def apply_attention_fusion(self, query_states, key_states, topk_indices,
attn_metadata: M, is_prefill: bool):
# ...
if is_prefill:
prefill_metadata = attn_metadata.prefill
assert prefill_metadata is not None
# ... prefill logic
else: # is_decode
decode_metadata = attn_metadata.decode
assert decode_metadata is not None
# ... decode logic
You will also need to update the call sites in the forward
method to pass this flag:
is_prefill=False
for the decode call (line 827).is_prefill=True
for the prefill call (line 835).
def indexer_select( | ||
self, | ||
x: torch.Tensor, | ||
qr: torch.Tensor, | ||
kv_cache: Tuple[torch.Tensor, torch.Tensor, torch.Tensor], | ||
attn_metadata: M, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has the same issue as apply_attention_fusion
. The if attn_metadata.prefill is not None:
check is not sufficient to distinguish between prefill and decode paths in a mixed batch. This will cause incorrect metadata to be used for decode requests.
Please add an is_prefill
flag to the function signature and use it to correctly select the metadata.
def indexer_select(
self,
x: torch.Tensor,
qr: torch.Tensor,
kv_cache: Tuple[torch.Tensor, torch.Tensor, torch.Tensor],
attn_metadata: M,
is_prefill: bool,
):
if is_prefill:
prefill_metadata = attn_metadata.prefill
assert prefill_metadata is not None
cos = prefill_metadata.cos
# ...
else: # is_decode
decode_metadata = attn_metadata.decode
assert decode_metadata is not None
cos = decode_metadata.cos
# ...
The call sites in _sfa_preprocess
should be updated to pass is_prefill=False
for the decode path (line 690) and is_prefill=True
for the prefill path (line 778).
# Profiling run. | ||
return output | ||
|
||
if attn_metadata.prefill is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if attn_metadata.prefill is not None:
is incorrect for handling mixed batches. In a batch with both prefill and decode requests, this condition will always be true, and the decode path in the elif
block (line 1099) will never be reached. This will cause decode requests to be processed with prefill logic, which is incorrect.
The AscendSFATorchairMetadataBuilder
sets is_prefill
and is_decode
flags. Please use if attn_metadata.is_prefill:
here, and elif attn_metadata.is_decode:
for the decode block to ensure correct path selection.
): | ||
if attn_metadata.prefill is not None: | ||
cos = attn_metadata.prefill.cos | ||
sin = attn_metadata.prefill.sin | ||
elif attn_metadata.decode is not None: | ||
cos = attn_metadata.decode.cos | ||
sin = attn_metadata.decode.sin | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The is_prefill
parameter is passed to this function but it is not used. Instead, the function uses if attn_metadata.prefill is not None:
, which is incorrect for mixed batches and will lead to using prefill metadata for decode requests.
Please refactor this function to use the is_prefill
flag to correctly select the metadata from either attn_metadata.prefill
or attn_metadata.decode
.
sfa_bytes = 128 * self.block_size * get_dtype_size( | ||
self.dtype) if self.use_sfa else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value 128
is hardcoded for calculating sfa_bytes
. This seems to correspond to the head_dim
of the SFA indexer. Hardcoding this value makes the code less maintainable and harder to adapt for models with different indexer dimensions.
This value should be retrieved from the model configuration and passed to AttentionSpec
instead of being hardcoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for a new Sparse Flash Attention (SFA) backend, primarily for DeepSeek V2 models on Ascend NPUs. The changes are extensive, adding new attention implementations for both eager mode (sfa_v1.py
) and torch.compile/torchair mode (torchair_sfa.py
), along with necessary modifications across the model definition, configuration, and distributed components. While the implementation is comprehensive, I've identified several critical and high-severity issues related to correctness and maintainability, including incorrect metadata class usage, type mismatches, hardcoded values, and significant code duplication. These issues should be addressed to ensure the stability and correctness of the new SFA backend. The pull request description is also missing, which would be very helpful for understanding the context and testing of these complex changes.
39ea752
to
c3120ba
Compare
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: zzzzwwjj <[email protected]> Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wxsIcey <[email protected]> Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
…or deepseek_r1 Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: linfeng-yuan <[email protected]> Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: MengqingCao <[email protected]> Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
9666738
to
2738c6f
Compare
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: linfeng-yuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: linfeng-yuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
Signed-off-by: wangxiyuan <[email protected]>
ut passed here: #3277 |
What this PR does / why we need it?
This PR added the initial DeepSeek V3.2 support with vLLM v0.11.0 (not released yet). We will complete vLLM adaptation as soon as possible. This feature will be ready in recent 1-2 days.
Related doc: #3223 .
Does this PR introduce any user-facing change?
Yes!
How was this patch tested?
CI passed and Run deepseek doc soon.